Portable Language Technology: a Resource-light Approach to Morpho-syntactic Tagging
نویسنده
چکیده
Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number, gender, and other morphological information to each word in a corpus. Morpho-syntactic tagging is an important step in natural language processing. Corpora that have been morphologically tagged are very useful both for linguistic research, e.g. finding instances or frequencies of particular constructions in large corpora, and for further computational processing, such as syntactic parsing, speech recognition, stemming, and word-sense disambiguation, among others. Despite the importance of morphological tagging, there are many languages that lack annotated resources. This is almost inevitable because these resources are costly to create. But, as described in this thesis, it is possible to avoid this expense. This thesis describes a method for transferring annotation from a morphologically annotated corpus of a source language to a corpus of a related target language. Unlike unsupervised approaches that do not require annotated data at all and, as a consequence, lack precision, the approach proposed in this dissertation relies on linguistic knowledge, but avoids large-scale grammar engineering. The approach needs neither a parallel corpus nor a bilingual lexicon, and requires much less linguistic labor than the standard technology. This dissertation describes experiments with Russian, Czech, Polish, Spanish, Portuguese, and Catalan. However, the general method proposed can be applied to any fusional language.
منابع مشابه
An improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملA resource-light approach to morpho-syntactic tagging.Anna Feldman and Jirka Hana
Anna Feldman and Jirka Hana had a problem. Wanting to extract Russian verb frames, they lacked a tool for the necessary first step: morphological analysis of Russian words, disambiguated for context. To avoid the significant overhead of building a contextual-ized morphological analyzer from scratch, Feldman and Hana wondered if an analyzer that was already available for Czech would perform adeq...
متن کاملBuilding an old Occitan corpus via cross-Language transfer
This paper describes the implementation of a resource-light approach, cross-language transfer, to build and annotate a historical corpus for Old Occitan. Our approach transfers morpho-syntactic and syntactic annotation from resource-rich source languages, Old French and Catalan, to a genetically related target language, Old Occitan. The present corpus consists of three sub-corpora in XML format...
متن کاملA Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources
We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languag...
متن کاملMorpho-syntactic tagging system based on the patterns words for arabic texts
Text tagging is a very important tool for various applications in natural language processing, namely the morphological and syntactic analysis of texts, indexation and information retrieval, "vocalization" of Arabic texts, and probabilistic language model (n-class model). However, these systems based on the lexemes of limited size, are unable to treat unknown words consequently. To overcome thi...
متن کامل